This dataset contains loan information for approx. 114,000 loans. This analysis is designed to answer meaningful questions about the loan data in question and perform visualizations and aggregations in order to answer those questions. The analysis is as follow:
## ListingNumber ListingCreationDate CreditGrade Term
## 1 193129 2007-08-26 19:09:29.263000000 C 36
## 2 1209647 2014-02-27 08:28:07.900000000 36
## 3 81716 2007-01-05 15:00:47.090000000 HR 36
## 4 658116 2012-10-22 11:02:35.010000000 36
## ListingKey EmploymentStatus EstimatedLoss
## 17A93590655669644DB4C06: 6 Employed :67322 Min. :0.005
## 349D3587495831350F0F648: 4 Full-time :26355 1st Qu.:0.042
## 47C1359638497431975670B: 4 Self-employed: 6134 Median :0.072
## 8474358854651984137201C: 4 Not available: 5347 Mean :0.080
## DE8535960513435199406CE: 4 Other : 3806 3rd Qu.:0.112
## 04C13599434217079754AEE: 3 : 2255 Max. :0.366
## (Other) :113912 (Other) : 2718 NA's :29084
## BorrowerAPR LoanStatus MonthlyLoanPayment
## Min. :0.00653 Current :56576 Min. : 0.0
## 1st Qu.:0.15629 Completed :38074 1st Qu.: 131.6
## Median :0.20976 Chargedoff :11992 Median : 217.7
## Mean :0.21883 Defaulted : 5018 Mean : 272.5
## 3rd Qu.:0.28381 Past Due (1-15 days) : 806 3rd Qu.: 371.6
## Max. :0.51229 Past Due (31-60 days): 363 Max. :2251.5
## NA's :25 (Other) : 1108
## LoanOriginalAmount LoanCurrentDaysDelinquent IncomeRange
## Min. : 1000 Min. : 0.0 $25,000-49,999:32192
## 1st Qu.: 4000 1st Qu.: 0.0 $50,000-74,999:31050
## Median : 6500 Median : 0.0 $100,000+ :17337
## Mean : 8337 Mean : 152.8 $75,000-99,999:16916
## 3rd Qu.:12000 3rd Qu.: 0.0 Not displayed : 7741
## Max. :35000 Max. :2704.0 $1-24,999 : 7274
## (Other) : 1427
## DebtToIncomeRatio AmountDelinquent EstimatedReturn
## Min. : 0.000 Min. : 0.0 Min. :-0.183
## 1st Qu.: 0.140 1st Qu.: 0.0 1st Qu.: 0.074
## Median : 0.220 Median : 0.0 Median : 0.092
## Mean : 0.276 Mean : 984.5 Mean : 0.096
## 3rd Qu.: 0.320 3rd Qu.: 0.0 3rd Qu.: 0.117
## Max. :10.010 Max. :463881.0 Max. : 0.284
## NA's :8554 NA's :7622 NA's :29084
Most distributions are right-skewed with the exceptions of Estimated Return and Borrower APR. There are also are large amount of 0 values present in Amount Delinuqent and Loan Current Days Delinquent.
## # A tibble: 9 x 4
## EmploymentStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 "" 1041. 116. 159.
## 2 Employed 2252. 276. 310.
## 3 Full-time 1563. 163. 217.
## 4 Not available 1048. 126. 191.
## 5 Not employed 1086. 169. 183.
## 6 Other 1035. 173. 232.
## 7 Part-time 984. 106. 141.
## 8 Retired 1046. 131. 168.
## 9 Self-employed 1380. 238. 283.
The employment category that has the highest average and maximum monthly payment is Employed.
## # A tibble: 8 x 4
## IncomeRange max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 $0 33134 0 473.
## 2 $1-24,999 160039 0 658.
## 3 $100,000+ 463881 0 1242.
## 4 $25,000-49,999 444745 0 897.
## 5 $50,000-74,999 284169 0 909.
## 6 $75,000-99,999 265084 0 1204.
## 7 Not displayed 10574 0 419.
## 8 Not employed 91554 0 657.
The largest delinquent amount in terms of both average and max is 100K+ which is surprising though creating a percentage of delinquency amount using income range may tell a different story.
## # A tibble: 12 x 4
## LoanStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 0.34 0.152 0.165
## 2 Chargedoff 10.0 0.21 0.339
## 3 Completed 10.0 0.19 0.264
## 4 Current 10.0 0.23 0.262
## 5 Defaulted 10.0 0.22 0.369
## 6 FinalPaymentInProgress 0.570 0.2 0.227
## 7 Past Due (>120 days) 0.62 0.225 0.278
## 8 Past Due (1-15 days) 6.49 0.23 0.265
## 9 Past Due (16-30 days) 10.0 0.25 0.339
## 10 Past Due (31-60 days) 1.19 0.22 0.256
## 11 Past Due (61-90 days) 10.0 0.22 0.376
## 12 Past Due (91-120 days) 3.17 0.24 0.292
The loan status with the highest average debt to income ratio is Past Due at 0.376. Several categories are tied with the same max value (10.01) which is consistent with the description of the max capped value for the Debt to Income Ratio variable. These loan status are Charged Off, Completed, Current, Defaulted, and the previously mentioned past due.
## # A tibble: 7 x 4
## EmploymentStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Employed 0.224 0.0905 0.0947
## 2 Full-time 0.267 0.102 0.103
## 3 Not employed 0.226 0.122 0.119
## 4 Other 0.176 0.099 0.0991
## 5 Part-time 0.184 0.110 0.106
## 6 Retired 0.203 0.112 0.106
## 7 Self-employed 0.284 0.0936 0.0973
The employment status with the highest max estimated returned value is Self-Employed with 0.28. However the highest average value is not-employed. This is also true for median. This situation is due to outliers in the self-employed category, which are more extreme on the positive side than not-employed, even though not-employed’s category’s median and average are larger.
The category with the strongest negative correlation is self-employed which appears as the most linear line in this scatterplot.
## # A tibble: 7 x 4
## IncomeRange max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 $0 0.170 0.124 0.116
## 2 $1-24,999 0.226 0.112 0.109
## 3 $100,000+ 0.267 0.0827 0.0879
## 4 $25,000-49,999 0.257 0.101 0.102
## 5 $50,000-74,999 0.284 0.0903 0.0948
## 6 $75,000-99,999 0.257 0.0872 0.0920
## 7 Not employed 0.226 0.122 0.119
The income range with the maximum estimated return is the 50K to 74.9K range. However the income range with highest median value is 0 dollars and the highest average value is Not employed. This could be due to non-employed people marking their income as 0 dollars, but more information is needed to come to that conclusion.
## # A tibble: 9 x 4
## EmploymentStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 "" 2704 0 344.
## 2 Employed 1222 0 26.9
## 3 Full-time 2497 0 378.
## 4 Not available 2599 0 497.
## 5 Not employed 2367 0 189.
## 6 Other 1052 0 37.8
## 7 Part-time 2389 0 329.
## 8 Retired 2394 0 382.
## 9 Self-employed 2421 0 202.
The category with the highest maximum in delinquency is uncategorized and the highest average value is not available. This could be that the uncategorized employment status could be not available but more information is needed to make that conclusion.
## # A tibble: 12 x 4
## LoanStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 111. 39.1 61.5
## 2 Chargedoff 1553. 174. 235.
## 3 Completed 2252. 171. 219.
## 4 Current 1781. 290. 320.
## 5 Defaulted 1103. 168. 233.
## 6 FinalPaymentInProgress 2164. 218. 298.
## 7 Past Due (>120 days) 578. 268. 281.
## 8 Past Due (1-15 days) 1385. 238. 285.
## 9 Past Due (16-30 days) 1278. 201. 277.
## 10 Past Due (31-60 days) 1237. 220. 282.
## 11 Past Due (61-90 days) 904. 201. 258.
## 12 Past Due (91-120 days) 884. 205. 263.
The loan status with the highest max monthly payment is completed while the category with the highest median and average monthly payment is current. This is due to the extremity of the outliers for the completed category, which could be accounted by people paying off their loan balances.
## # A tibble: 8 x 4
## IncomeRange max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 $0 1131. 170. 267.
## 2 $1-24,999 1048. 134. 155.
## 3 $100,000+ 2252. 375. 412.
## 4 $25,000-49,999 1382. 174. 210.
## 5 $50,000-74,999 1778. 253. 280.
## 6 $75,000-99,999 2112. 302. 329.
## 7 Not displayed 1048. 122. 182.
## 8 Not employed 1086. 170. 184.
The income range with the highest monthly payment is the 100K+ range which makes sense given that people with larger incomes can afford larger loans.
## # A tibble: 11 x 4
## LoanStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Chargedoff 0.284 0.125 0.123
## 2 Completed 0.267 0.107 0.102
## 3 Current 0.176 0.0863 0.0902
## 4 Defaulted 0.254 0.127 0.123
## 5 FinalPaymentInProgress 0.170 0.0922 0.0933
## 6 Past Due (>120 days) 0.149 0.120 0.119
## 7 Past Due (1-15 days) 0.176 0.111 0.111
## 8 Past Due (16-30 days) 0.159 0.112 0.111
## 9 Past Due (31-60 days) 0.173 0.112 0.111
## 10 Past Due (61-90 days) 0.170 0.114 0.114
## 11 Past Due (91-120 days) 0.176 0.115 0.115
The Loan Status with the maximum rate of return is Charged off, but the status with the highest median and average rate of returns is Defaulted. This is due to large outliers in the charged off category.
## # A tibble: 12 x 4
## LoanStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Cancelled 0.256 0.207 0.206
## 2 Chargedoff 0.462 0.263 0.258
## 3 Completed 0.512 0.195 0.209
## 4 Current 0.358 0.205 0.214
## 5 Defaulted 0.506 0.240 0.239
## 6 FinalPaymentInProgress 0.358 0.224 0.230
## 7 Past Due (>120 days) 0.358 0.287 0.284
## 8 Past Due (1-15 days) 0.381 0.267 0.264
## 9 Past Due (16-30 days) 0.358 0.273 0.268
## 10 Past Due (31-60 days) 0.358 0.273 0.266
## 11 Past Due (61-90 days) 0.358 0.275 0.273
## 12 Past Due (91-120 days) 0.385 0.282 0.271
The loan status with the highest maximum APR is completed, however the category with highest median and average value Past Due (>120 days). This is due to the extremes of the completed category being more that the Past Due (>120 days).
## # A tibble: 7 x 4
## IncomeRange max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 $1-24,999 10.0 0.32 0.737
## 2 $100,000+ 10.0 0.17 0.181
## 3 $25,000-49,999 7.9 0.26 0.279
## 4 $50,000-74,999 10.0 0.23 0.246
## 5 $75,000-99,999 2.55 0.2 0.214
## 6 Not displayed 10.0 0.16 0.297
## 7 Not employed 10.0 0.295 3.33
All but two of the income range categories reach the maximum debt to income ratio value, which is due to the ceiling placed on the maximum value. So the median and average values are more meaningful. The category with the highest median values is 1-24,99, and the category with the highest average debt to income ratio is Not employed. Which makes sense given the lower income brackets.
## # A tibble: 7 x 4
## EmploymentStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Employed 0.224 0.0905 0.0947
## 2 Full-time 0.267 0.102 0.103
## 3 Not employed 0.226 0.122 0.119
## 4 Other 0.176 0.099 0.0991
## 5 Part-time 0.184 0.110 0.106
## 6 Retired 0.203 0.112 0.106
## 7 Self-employed 0.284 0.0936 0.0973
The employment status with the maximum rate of return is Self-Employed, while the highest median and average values is from not employed. This is due to the extremity of the outliers of self-employed, which also has a large number of outliers as well.
## # A tibble: 9 x 4
## EmploymentStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 "" 1041. 116. 159.
## 2 Employed 2252. 276. 310.
## 3 Full-time 1563. 163. 217.
## 4 Not available 1048. 126. 191.
## 5 Not employed 1086. 169. 183.
## 6 Other 1035. 173. 232.
## 7 Part-time 984. 106. 141.
## 8 Retired 1046. 131. 168.
## 9 Self-employed 1380. 238. 283.
The employment status with the highest maximum, median, and average monthly payment is Employed. Which makes sense given that category represents clients that is more likely to make regular monthly payments.
## # A tibble: 11 x 4
## LoanStatus max med avg
## <fct> <dbl> <dbl> <dbl>
## 1 Chargedoff 0.366 0.112 0.116
## 2 Completed 0.366 0.085 0.0868
## 3 Current 0.203 0.0649 0.0734
## 4 Defaulted 0.366 0.112 0.112
## 5 FinalPaymentInProgress 0.183 0.0774 0.0833
## 6 Past Due (>120 days) 0.165 0.106 0.109
## 7 Past Due (1-15 days) 0.183 0.0975 0.0980
## 8 Past Due (16-30 days) 0.195 0.0975 0.101
## 9 Past Due (31-60 days) 0.203 0.099 0.0996
## 10 Past Due (61-90 days) 0.183 0.0975 0.103
## 11 Past Due (91-120 days) 0.168 0.102 0.101
There are three loan statuses with the maximum estimated loss, which are charged off, completed, and defaulted. Two categories are also tied for the highest median value with are charged off and defaulted. Though charged off has the sole highest average value.
The following are highlights from the visualizations used in this assignment and what was learned during their creation.
In this scatter matrix not only are there several variables that are strongly correlated with each other, but that there were also interesting scatterplot patterns that were worth examining further.
In this scatterplot of Estimated Return vs. Monthly Loan Payment instead of descending to the lowest value, the data points seem to gather around 0.05 horizontally and are more noisy towards 0. This is a unique data point pattern which lead to a negative correlation.
For the boxplot summaries, it showed that meaningful information can be found by grouping data together and that NA data needed to be removed in order to reduce the amount of errors present in the final visualizations.
The Prosper loan dataset was a large rich dataset with decent amount of variables that describe the the loans taken out on their platform. There were issues surrounding missing data and distributions with a lot of outliers. So learning how to handle bad data and use logarithmic axes were key to completing this project. Future questions could a combination of Monthly Payment and Borrower APR could predict income level, could any other combination of continuous variables predict a given factor? This could lead to classification model possiblities with this dataset.